!ls
# might need to install lime in the docker or system I am using
#!pip install lime
from pyspark.sql import SQLContext
from pyspark import SparkContext
import pandas as pd
import numpy as np
import scipy.stats as scs
import statsmodels.api as sm
import matplotlib.pyplot as plt
import lime
import sklearn
import sklearn.ensemble
import sklearn.metrics
from __future__ import print_function
%matplotlib inline
%config InlineBackend.figure_format='retina'
df = pd.read_csv('small_descr_clm_code.csv')
df.drop('Unnamed: 0',axis=1, inplace=True)
df.head()
def remove_string(dataframe,column_list,string_in_quotes):
'''
Input:
dataframe: name of pandas dataframe
column_list: list of column name strings (ex. ['col_1','col_2'])
string_in_quotes: string to remove in quotes (ex. ',')
Output:
none
modifies pandas dataframe to remove string.
Example:
remove_string(df, ['col_1','col_2'], ',')
Warning:
If memory issues occur, limit to one column at a time.
'''
for i in column_list:
dataframe[i] = dataframe[i].str.replace(string_in_quotes,"").astype(str)
remove_string(df, ['descr'],',')
remove_string(df, ['clm'],',')
remove_string(df, ['descr'],'\n')
remove_string(df, ['clm'],'\n')
df.iloc[0]['clm']
Pandas introduced Categoricals in version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column. citation
df['code'] = df['code'].astype('category')
df.info()
df.info(memory_usage='deep')
from sklearn.model_selection import train_test_split
X = df['descr']
y = df['code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
print(len(X))
print(X.shape)
print(len(y))
print(y.shape)
print(len(X_train))
print(X_train.shape)
print(len(y_train))
print(y_train.shape)
print(len(X_test))
print(X_test.shape)
Let's use the tfidf vectorizer, commonly used for text.
from sklearn import feature_extraction
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
Now, let's say we want to use random forests for classification. It's usually hard to understand what random forests are doing, especially with many trees.
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, y_train) #rf.fit(train_vectors, newsgroups_train.target)
from sklearn.metrics import roc_curve, auc, f1_score, recall_score, precision_score
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average=None)
pred2 = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred2, average='weighted')
pred3 = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred3, average='binary')
Lime explainers assume that classifiers act on raw text, but sklearn classifiers act on vectorized representation of texts. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
print(c.predict_proba([X_test.iloc[0]]))
Now we create an explainer object. We pass the class_names a an argument for prettier display.
from lime.lime_text import LimeTextExplainer
class_names = ['705','706']
explainer = LimeTextExplainer(class_names=class_names)
We then generate an explanation with at most 6 features for an arbitrary document in the test set.
idx = 83
exp = explainer.explain_instance(X_test.iloc[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(706) =', c.predict_proba([X_test.iloc[idx]])[0,1])
print('True class: %s' % y_test.iloc[idx])
print(y_test.iloc[idx])
print(X_test.iloc[idx])
exp.as_list()
from docs of lime: These weighted features are a linear model, which approximates the behaviour of the random forest classifier in the vicinity of the test example. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case. citation
try
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['transactions']] = 0
tmp[0,vectorizer.vocabulary_['state']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])
They probably were not worth much. I need to do word removal or work with n-grams
The explanations can be returned as a matplotlib barplot:
fig = exp.as_pyplot_figure()
The explanations can also be exported as an html page (which we can render here in this notebook), using D3.js to render graphs.
exp.show_in_notebook(text=False)
Alternatively, we can save the fully contained html page to a file:
exp.save_to_file('/tmp/oi.html')
Finally, we can also include a visualization of the original document, with the words in the explanations highlighted. Notice how the words that affect the classifier the most are all in the email header.
exp.show_in_notebook(text=True)
from sklearn.model_selection import train_test_split
X = df['clm']
y = df['code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
Let's use the tfidf vectorizer, commonly used for text.
from sklearn import feature_extraction
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
Now, let's say we want to use random forests for classification. It's usually hard to understand what random forests are doing, especially with many trees.
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)
rf.fit(train_vectors, y_train) #rf.fit(train_vectors, newsgroups_train.target)
from sklearn.metrics import roc_curve, auc, f1_score, recall_score, precision_score
pred = rf.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average=None)
Lime explainers assume that classifiers act on raw text, but sklearn classifiers act on vectorized representation of texts. For this purpose, we use sklearn's pipeline, and implements predict_proba on raw_text lists.
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, rf)
print(c.predict_proba([X_test.iloc[0]]))
Now we create an explainer object. We pass the class_names a an argument for prettier display.
from lime.lime_text import LimeTextExplainer
class_names = ['705','706']
explainer = LimeTextExplainer(class_names=class_names)
We then generate an explanation with at most 6 features for an arbitrary document in the test set.
idx = 83
exp = explainer.explain_instance(X_test.iloc[idx], c.predict_proba, num_features=6)
print('Document id: %d' % idx)
print('Probability(706) =', c.predict_proba([X_test.iloc[idx]])[0,1])
print('True class: %s' % y_test.iloc[idx])
print(y_test.iloc[idx])
print(X_test.iloc[idx])
exp.as_list()
from docs of lime: These weighted features are a linear model, which approximates the behaviour of the random forest classifier in the vicinity of the test example. Roughly, if we remove 'Posting' and 'Host' from the document , the prediction should move towards the opposite class (Christianity) by about 0.27 (the sum of the weights for both features). Let's see if this is the case. citation
try
print('Original prediction:', rf.predict_proba(test_vectors[idx])[0,1])
tmp = test_vectors[idx].copy()
tmp[0,vectorizer.vocabulary_['provider']] = 0
tmp[0,vectorizer.vocabulary_['state']] = 0
print('Prediction removing some features:', rf.predict_proba(tmp)[0,1])
print('Difference:', rf.predict_proba(tmp)[0,1] - rf.predict_proba(test_vectors[idx])[0,1])
They probably were not worth much. I need to do word removal or work with n-grams
The explanations can be returned as a matplotlib barplot:
fig = exp.as_pyplot_figure()
The explanations can also be exported as an html page (which we can render here in this notebook), using D3.js to render graphs.
exp.show_in_notebook(text=False)
Alternatively, we can save the fully contained html page to a file:
exp.save_to_file('/tmp/oi_claim.html')
Finally, we can also include a visualization of the original document, with the words in the explanations highlighted. Notice how the words that affect the classifier the most are all in the email header.
exp.show_in_notebook(text=True)